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ABSTRACT 

Experimental evaluation is a major research methodology for in¬ 
vestigating clustering algorithms. For this purpose, a number of 
benchmark datasets have been widely used in the literature and 
their quality plays an important role on the value of the research 
work. However, in most of the existing studies, little attention has 
been paid to the specific properties of the datasets and they are often 
regarded as black-box problems. In our work, with the help of ad¬ 
vanced visualization and dimension reduction techniques, we show 
that there are potential issues with some of the popular benchmark 
datasets used to evaluate clustering algorithms that may seriously 
compromise the research quality and even may produce completely 
misleading results. We suggest that significant efforts need to be 
devoted to improving the current practice of experimental evalu¬ 
ation of clustering algorithms by having a principled analysis of 
each benchmark dataset of interest. 
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1 INTRODUCTION 

Clustering is one of the fundamental research areas in data science 
and has found numerous applications in a wide range of domains 
such as e-commerce, custom relationship management, image pro¬ 
cessing, and bioinformatics [1, 3, 8]. As an unsupervised learning 
paradigm, clustering does not require manually assigning labels 
to the original data, which may be very expensive in real-world 
scenarios. Instead, clustering aims at automatically exploring the 
inherent structure of the datasets to help people acquire an in-depth 
appreciation of the key properties of the data. 
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Similar to the research work on classification, the common prac¬ 
tice for investigating clustering techniques is by empirical studies 
where a set of benchmark datasets are used to quantitatively evalu¬ 
ate the performance of specific algorithms. As a result, it is clear that 
the quality of benchmark datasets plays a key role on the validness 
or effectiveness of the research outcomes. 

To comprehend the current standard of clustering research, we 
reviewed a number of representative literatures, including survey 
papers [6, 12, 13] and some recent publications in leading jour¬ 
nals and conferences [14, 15]. There are mainly two categories of 
datasets in use: i) synthetic datasets, which are often of low dimen¬ 
sions for illustration purpose; ii) real-world datasets, which can be 
flexible in terms of dimensionality. Normally, one or two synthetic 
2D or 3D datasets are used to demonstrate the procedure and mech¬ 
anism of the clustering algorithms, as they are relatively easy to be 
visualized. After that, real-world datasets come into play to provide 
further evidence on the practical performance of the clustering al¬ 
gorithms, as it is often assumed that as long as real-world datasets 
are in use, it is reasonably plausible to make conclusive claims. 

Due to their unsupervised nature, clustering algorithms do not 
require the datasets to be labeled. However, they do need cluster 
labels as the ground truth against which to judge the quality of 
clustering. For instance, for a 2D or 3D dataset, it is possible for a 
researcher to visually identify its clustering pattern (i.e., the number 
of clusters and the membership of each data point) and use it as the 
ground truth. However, for higher dimensional datasets, it would 
be very challenging for a researcher to make the same judgement 
due to apparent difficulties in visualization. Consequently, people 
often choose to use standard benchmark datasets such as those in 
the UCI repository [10] that come with existing labels. 

Unfortunately, as this paper will point out, it is a serious flaw in 
clustering research that has been prevalent for many years without 
any sign of decease. We claim that this defective research method¬ 
ology has significantly compromised the quality of clustering re¬ 
search, resulting in inaccurate or completely misleading results. The 
key issue is that those labels are defined for classification purpose, 
not clustering, and mixing the two scenarios without any clear 
justification can produce unpredictable consequences. For example, 
a dataset may be created by collecting some data from male subjects 
and female subjects, respectively, and assigning the label “male” 
or “female” to each corresponding data record. In such case, the 
label itself can be only used to indicate the property of a certain 
data record (i.e., its class property), instead of the distribution of 
the entire dataset, which is the main concern of clustering analysis. 

The major contribution of our paper is to highlight the impor¬ 
tance of benchmark datasets and raise the alarm about the current 
practice of evaluating clustering algorithms. In the next section, we 
briefly introduce the performance metrics of clustering algorithms 
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and visualization techniques to be used in our work. In Section 3 
and Section 4, we present two insightful case studies to show why 
it is not technically sound to use class labels as the ground truth 
for cluster labels. This paper is concluded in Section 5 with further 
discussions on the better practice of clustering research. 

2 PRELIMINARIES 
2.1 Performance Metrics 

Generally, the performance metrics for clustering can be divided 
into the following two categories: 

1. Internal Criteria: focus on the relationships among clusters, 
such as the compactness of each cluster and the separation between 
clusters, which do not require cluster labels. 

2. External Criteria: focus on the distribution differences between 
clustering results and ground truth, which require the true cluster 
labels. 

2.7.7 Davies-Bouldin Index (DBI). DBI [2] is an internal criterion 
defined as: 


DBI 
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where d cen (/ui, gj) is the distance between vectors chosen as the 
representatives of clusters i and j, and avg{ci) and avg(cj) are the 
dispersions of clusters i and j, respectively, and k is the number 
of clusters. The minimization of its value (ranged from [0,+oo)) 
indicates natural partitions of datasets. 


2.7.2 Silhouette Coefficient (SC). SC [9] is also an internal criterion 
where each cluster is represented by a silhouette and the entire 
clustering result is presented by combining the silhouettes into a 
single plot. The average silhouette width provides an estimation of 
clustering validity, which is defined as: 


2.7.4 Normalized Mutual Information (NMI). The Mutual Informa¬ 
tion (MI) [11] is a measure that quantifies the mutual dependence 
between two random variables, or the information that two ran¬ 
dom variables share. In data mining, it can be used to determine 
the similarity of two clustering results U and V of a dataset, which 
is defined as: 

, x MI(U, V) 

NMI(U, V) = - — (4) 

Imwyj 

where MI(U, V ) is the MI between two partitions and H(U ) and 
H(V) are the entropy values. The maximization of its value (ranged 
from [0, 1]) indicates the largest goodness of fit between two parti¬ 
tions. 


2.2 Dimension Reduction 

For clustering research, it is an essential yet challenging task to 
explore the structure of high-dimensional datasets. With the help of 
dimension reduction techniques, it is possible to have some intuitive 
clue about the key features of data distribution in the original space. 
For example, although two separate groups of data may overlap 
with each other once projected to a 2D space, two separate groups 
of data in a projected 2D space can imply that these two groups of 
data are also separate in the higher dimensional space. 

The t-SNE algorithm [7] is a powerful tool for dimension re¬ 
duction, which converts the distance between two points in space 
into probability. The distance in the original space is represented 
by a Gaussian distribution, and the probability in the embedded 
space is represented by a t-distribution. The Kullback-Leibler (KL) 
divergence of the joint probability density of the original space and 
the embedded space is used as the loss function, and the gradient 
descent rule is used to minimize the loss function to obtain the 
optimal solution. The KL divergence of the probability distributions 
P and Q is defined as: 

D J a.(P|le> = -5>W ln f§ (5) 
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where a(i) is the average dissimilarity of i to all other objects of the 
same cluster and b(i) is the minimum of the average dissimilarities 
of i to all other clusters and n is the number of objects in the dataset. 
The maximization of its value (ranged from [-1, 1]) indicates the 
most reasonable clustering of datasets. 


2. 1.3 Adjusted Rand Index (ARI). ARI [5] is an external criterion 
used to measure the difference between two clustering results: 


RI - E[RI] _ a + b 

maxfRI) - E[RI] ’ " C" 
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where a is the number of paired objects that are placed in the same 
cluster in both partitions; b is the number of paired objects that are 
placed in different clusters in both partitions; n is the number of 
objects in the data; C” is the number of pairs that can be formed 
in the dataset. The maximization of its value (ranged from [-1, 1]) 
indicates the largest goodness of fit between the clustering result 
and the desired partition of data. 


3 CASE STUDY 1: OVERLAPPING DATA 

In this case study, we focus on the situation where groups of data 
with different class labels may overlap with each other. From clus¬ 
tering point of view, it means that some groups may be better 
regarded as a single cluster. Consequently, using class labels as the 
ground truth for clustering is not appropriate. 

For example, Figure 1 (a) shows the data distribution of a 2D 
dataset Engytime 1 , where objects in different colors belong to differ¬ 
ent classes. It is clear that there is some overlapping region between 
the two classes, resulting in a single clustering structure instead 
of two clusters. Since it is a 2-class dataset, it may be assumed to 
consist of two clusters. The result of K-means (k = 2) is shown in 
Figure 1 (b) where all objects in the dataset were grouped into two 
non-overlapping clusters. We also run the density-based clustering 
algorithm DBSCAN [4], which does not require the number of clus¬ 
ters as input. As expected, all objects were grouped into a single 
cluster by DBSCAN, as shown in Figure 1 (c). 

To further elucidate the issue, the clustering results of K-means, 
DBSCAN on Engytime were evaluated using the four metrics DBI, 

Ahe download link of Engytime: https://github.com/deric/clustering-benchmark/ 
blob/master/src/main/resources/datasets/artificial/engytime.arff 
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Figure 1: Engytime dataset (a), the clustering result of K- 
means (b) and the clustering result of DBSCAN (c). 

Table 1: The comparison of clustering results on Engytime 



DBI 

SC 

ARI 

NMI 

Class Label 

0.999 

0.406 

- 

- 

K-means (k = 2) 

0.948 

0.425 

0.815 

0.730 

DBSCAN 

- 

- 

0 

0.002 


SC, ARI and NMI. As shown in Table 1, in terms of internal criteria 
(DBI and SC), K-means was slightly better than directly using class 
labels to partition the dataset. Meanwhile, in terms of external 
criteria (ARI and NMI), the performance of DBSCAN, which only 
identified a single cluster, was extremely poor. However, since the 
two groups of data overlap with each other, it is more reasonable to 
regard the entire dataset as a single cluster instead of two clusters. 
Note that DBI and SC only make sense when there are at least two 
clusters. 

In addition, we analyzed the 4D Iris 2 dataset (3 classes) that 
often appears in clustering research. By visualizing the first three 
dimensions (sepal length, petal length, and petal width) as shown 
in Figure 2 (a), we found that the versicolor and virginica classes 
may be overlapped. We further visualized the data in all dimensions 
using parallel coordinates as shown in Figure 2 (b), which confirmed 
that the two classes are indeed overlapped. 

Similarly, the clustering results of K-means (k = 3) and DBSCAN 
on Iris were evaluated, in comparison to using class labels as the 
cluster labels. In Table 2, in terms of internal criteria (DBI and 
SC), K-means was slightly better than directly regarding the class 

2 The download link of Iris: https://github.com/deric/clustering-benchmark/blob/ 
master/src/main/resources/datasets/real-world/iris.arff 


Figure 2: The visualization of the Iris dataset 
Table 2: The comparison of clustering results on Iris 



DBI 

SC 

ARI 

NMI 

Class Label 

0.751 

0.504 

- 

- 

K-means (k = 3) 

0.662 

0.553 

0.730 

0.758 

DBSCAN 

0.383 

0.688 

0.568 

0.734 


labels as the cluster labels. Meanwhile, DBSCAN correctly identified 
two clusters and achieved the best scores in terms of DBI and SC. 
However, when using class labels as the ground truth for clustering, 
it was inferior than K-means. 

In summary, when using class labels as the ground truth, we have 
found significant inconsistences among the performance metrics 
regarding the same clustering algorithm as well as between its 
indicated performance and true performance. 

4 CASE STUDY 2: SPLIT DATA 

Another possibility is that objects with the same class label may 
correspond to multiple clusters. For example, Figure 3 (a) is the 
3D plot of selected objects belonging to the two classes named 
Climb_stairs (green) and Descend_stairs (red) in Accelerometer 3 . 
It is clear that objects in the red class are split into roughly two 
parts and one of them overlaps with the green class. With k = 2, 
K-means created two clusters with one cluster containing objects 
from both classes, as shown in Figure 3 (b). Meanwhile, DBSCAN 
produced very similar results as K-means, as shown in Figure 3 (c). 

In Table 3, the results of K-means and DBSCAN were much better 
than directly using class labels as cluster labels in terms of DBI and 
SC. However, given class labels as the ground truth, both K-means 
and DBSCAN produced very inferior ARI and NMI values, which 
was not consistent with the good clustering results as shown in 
Figure 3. This example demonstrates again that using class labels 
as the ground truth for clustering research can be problematic as 
objects with the same class label can be split into separate clusters. 

For high-dimensional datasets, objects with the same class label 
are more likely to correspond to different clusters due to the sparsity 
of data in high-dimensional spaces. Figure 4 shows the 2D plot of 
Vertebral 4 (6D, 2 classes) after dimension reduction using f-SNE. 

3 The download link of Accelerometer: https://archive.ics.uci.edu/ml/datasets/Dataset+ 
for+ADL+Recognition+with+Wrist-worn+Accelerometer 

4 The download link of Vertebral: https://archive.ics.uci.edu/ml/datasets/Vertebral+ 
Column 
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Figure 3: Accelerometer dataset (a), the clustering result of 
K-means (b) and the clustering result of DBSCAN (c) 

Table 3: The comparison of clustering results on Accelerom¬ 
eter 



DBI 

SC 

ARI 

NMI 

Class Label 

2.256 

-0.044 

- 

- 

K-means (k = 2) 

0.421 

0.722 

-0.070 

0.107 

DBSCAN 

0.345 

0.719 

-0.069 

0.095 


It shows that objects with class label 1 (green) were split into two 
clusters. Although we cannot tell whether some of them overlap 
with the objects with class label 0 (red) in the original space, it is 
sure that objects with class label 1 are not distributed in the form 
of a single cluster. 



Figure 4: The 2D projection of Vertebral using f-SNE 

In Table 4, both K-means (k = 2) and DBSCAN produced better 
results than directly using class labels as cluster labels. However, if 
the class label information is used as the ground truth for evaluating 


Table 4: The comparison of clustering results on Vertebral 



DBI 

SC 

ARI 

NMI 

Class Label 

1.557 

0.108 

- 

- 

K-means (k = 2) 

0.911 

0.449 

0.105 

0.255 

DBSCAN 

0.099 

0.858 

-0.003 

0.004 


clustering algorithms, both two algorithms produced low ARI and 
NMI values. 

5 CONCLUSION 

This paper calls for the close attention from the clustering research 
community on the current standard of empirical studies. In partic¬ 
ular, we show that it is problematic to directly use classification 
datasets in clustering research without any a prior justification. 
As shown in the two case studies, using class information as the 
ground truth for clustering may produce contradicting and mis¬ 
leading results. 

Instead of arbitrarily choosing a few black-box datasets from 
public repositories, it is highly recommended to have a clear un¬ 
derstanding about the structure of datasets to provide at least the 
basic level of assurance about their applicability. Furthermore, due 
to the challenge of accurately determining the true cluster labels 
for a non-trivial real-world dataset, advanced synthetic datasets 
with controlled structure may need to be purposefully generated to 
better support the principled evaluation of clustering algorithms. 
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